Day 22 資訊熵 Entropy 、交叉熵 Cross-entropy 於 NLP 之應用實作篇

2022 iThome 鐵人賽

DAY 22

AI & Data

語言學與NLP系列第 22 篇

14th鐵人賽 # decision tree # entropy # python # r

cjom06991

團隊KnULPers_from_NCCU

2022-10-07 22:12:42

2862 瀏覽

分享至

今天使用 iris 資料集分別示範 Python 和 R 如何計算 entropy。

Python Calculate Entropy

計算 entropy 的方式有許多種，在 python 裡可以自己定義一個 function 來做，也能直接引入 scipy.stats 這個套件裡面的 entropy 來做計算。結果也會因為計算方法的不同而有些微的誤差。本篇的示範為求精簡快速，所以直接引用寫好的套件。

先裝好需要的套件


import numpy as np
import pandas as pd
from scipy.stats import entropy


# loading required data
iris = datasets.load_iris()
iris = pd.DataFrame(iris.data, columns=iris.feature_names)
iris.rename(columns={'sepal length (cm)': 'Sepal.Length', 'sepal width (cm)': 'Sepal.Width', 'petal length (cm)': 'Petal.Length', 'petal width (cm)': 'Petal.Width'}, inplace=True)
print(iris)
print("\n=======================================\n")

執行結果為：

iris


# 填入空缺值

iris['Sepal_Length'] = iris['Sepal_Length'].fillna(iris['Sepal_Length'].mean())
iris['Sepal_Width'] = iris['Sepal_Width'].fillna(iris['Sepal_Width'].mean())
iris['Petal_Length'] = iris['Petal_Length'].fillna(iris['Petal_Length'].mean())
iris['Petal_Width'] = iris['Petal_Width'].fillna(iris['Petal_Width'].mean())


# 轉換成 array

S_length = np.array(iris.Sepal_Length)
print(S_length)
S_width = np.array(iris.Sepal_Width)
print(S_width)
P_length = np.array(iris.Petal_Length)
print(P_length)
P_width = np.array(iris.Petal_Width)
print(S_width)
print("\n=======================================\n")


# 建立字典來儲存結果

myDict = {"Sepal Length": entropy(S_length), "Sepal Width": entropy(S_width),
          "Petal Length": entropy(P_length), "Petal Width": entropy(P_width)}

print(myDict)
print("\n=======================================\n")

執行結果為：

{'Sepal Length': 5.000731141075856, 'Sepal Width': 5.0005878406055295, 'Petal Length': 4.887923095289775, 'Petal Width': 4.7718716614219225}

使用 entropy 為分類資訊，畫出 decision tree


from sklearn import tree
from matplotlib import pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target # 品種

# 建立樹 (criterion 使用內建 entropy 計算)
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=4)

clf.fit(X,y)

# 畫樹

fig, ax = plt.subplots(figsize=(6, 6)) # figsize 可以調圖片大小
tree.plot_tree(clf,ax=ax,feature_names=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'])
plt.show()

執行結果為：

tree

R Calculate Entropy


library(DescTools) # 引入 entropy 套件

data("iris") # 引入資料

寫好一次計算多個 entropy 的 function



calculate_entropy = function(features){
  return(Entropy(table(features), base = 2))
}


features_df = iris[,1:4] # 選擇 1~4 columns 為 feature
feature_names = names(features_df)

寫個 loop 進行一次計算


for (i in 1:length(features_df)) {
  target = features_df[,i]
  target_name = feature_names[i]
  cat(target_name, 'entropy == ', calculate_entropy(features_df[i]), '\n')
  }

執行結果為：

Sepal.Length entropy == 4.822018
Sepal.Width entropy == 4.023181
Petal.Length entropy == 5.03457
Petal.Width entropy == 4.049827

使用 entropy 作為 split index 訓練 decision tree



library(rpart)
library(rpart.plot)


data(iris)
iris$Species = factor(iris$Species)

set.seed(123)
train.index <- sample(x=1:nrow(iris), size=ceiling(0.8*nrow(iris) ))
train <- iris[train.index, ]
test <- iris[-train.index, ]

# train the model
dtree_model = rpart(Species ~. ,data=iris, parms = list(split = "information")) # information 就是指 entropy (default 是 gini)


rpart.plot(dtree_model)

執行結果為：

rtree